4 research outputs found
AlerTiger: Deep Learning for AI Model Health Monitoring at LinkedIn
Data-driven companies use AI models extensively to develop products and
intelligent business solutions, making the health of these models crucial for
business success. Model monitoring and alerting in industries pose unique
challenges, including a lack of clear model health metrics definition, label
sparsity, and fast model iterations that result in short-lived models and
features. As a product, there are also requirements for scalability,
generalizability, and explainability. To tackle these challenges, we propose
AlerTiger, a deep-learning-based MLOps model monitoring system that helps AI
teams across the company monitor their AI models' health by detecting anomalies
in models' input features and output score over time. The system consists of
four major steps: model statistics generation, deep-learning-based anomaly
detection, anomaly post-processing, and user alerting. Our solution generates
three categories of statistics to indicate AI model health, offers a two-stage
deep anomaly detection solution to address label sparsity and attain the
generalizability of monitoring new models, and provides holistic reports for
actionable alerts. This approach has been deployed to most of LinkedIn's
production AI models for over a year and has identified several model issues
that later led to significant business metric gains after fixing